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Abstract 



This paper is concerned with screening features in ultrahigh dimensional data anal- 
ysis, which has become increasingly important in diverse scientific fields. We develop a 
sure independence screening procedure based on the distance correlation (DC-SIS, for 
short). The DC-SIS can be implemented as easily as the sure independence screening 



procedure based on the Pearson correlation (SIS, for short) proposed by Fan and Lv 



(2008). However, the DC-SIS can significantly improve the SIS. Fan and Lv (2008) 
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established the sure screening property for the SIS based on linear models, but the 
sure screening property is valid for the DC-SIS under more general settings includ- 
ing linear models. Furthermore, the implementation of the DC-SIS does not require 
model specification (e.g., linear model or generalized linear model) for responses or 
predictors. This is a very appealing property in ultrahigh dimensional data analysis. 
Moreover, the DC-SIS can be used directly to screen grouped predictor variables and 
for multivariate response variables. We establish the sure screening property for the 
DC-SIS, and conduct simulations to examine its finite sample performance. Numerical 
comparison indicates that the DC-SIS performs much better than the SIS in various 
models. We also illustrate the DC-SIS through a real data example. 

Key words: Distance correlation, sure screening property, ultrahigh dimensionality, variable 
selection. 

Running Head: Distance Correlation Based SIS 
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1. INTRODUCTION 



Various regularization methods have been proposed for feature selection in high dimen- 
sional data analysis, which has become increasingly frequent and important in various re- 



search fields. These methods include, but are not limited to, the LASSO (Tibshirani, 1996) 



the SCAD (Fan and Li, 2001 Kim, Choi and Oh, 2008 Zou and Li, 2008), the LARS algo 



rithm (Efron, Hastie, Johnstone and Tibshirani, 2004), the elastic net (Zou and Hastie, 2005 



Zou and Zhang, 2009), the adaptive LASSO (Zou, 2006) and the Dantzig selector (Candes 



and Tao, 2007). All these methods allow the number of predictors to be greater than the 



sample size, and perform quite well for high dimensional data. 

With the advent of modern technology for data collection, researchers are able to collect 
ultrahigh dimensional data at relatively low cost in diverse fields of scientific research. The 
aforementioned regularization methods may not perform well for ultrahigh dimensional data 
due to the simultaneous challenges of computational expediency, statistical accuracy and 



algorithmic stability (Fan, Samworth and Wu, 2009). These challenges call for new statis- 



tical modeling techniques for ultrahigh dimensional data. Fan and Lv (2008) proposed the 



SIS and showed that the Pearson correlation ranking procedure possesses a sure screening 
property for linear regressions with Gaussian predictors and responses. That is, all truly 
important predictors can be selected with probability approaching one as the sample size 



diverges to oo. Hall and Miller (2009) extended Pearson correlation learning by considering 



polynomial transformations of predictors. To rank the importance of each predictor, they 



suggested a bootstrap procedure. Fan, Samworth and Wu (2009) and Fan and Song (2010) 



proposed a more general version of independent learning which ranks the maximum marginal 



likelihood estimators or the maximum marginal likelihood for generalized linear models. Fan, 



Feng and Song (2011) considered nonparametric independence screening in sparse ultrahigh 



dimensional additive models. They suggested estimating the nonparametric components 
marginally with spline approximation, and ranking the importance of predictors using the 



magnitude of nonparametric components. They also demonstrated that this procedure pos- 



sesses the sure screening property with vanishing false selection rate. Zhu, Li, Li and Zhu 



(2011) proposed a sure independent ranking and screening (SIRS) procedure to screen sig- 
nificant predictors in multi-index models. They further show that under linearity condition 
assumption on the predictor vector, the SIRS enjoys the ranking consistency property (i.e, 



the SIRS can rank the important predictors in the top asymptotically). Ji and Jin (2012 ) pro- 
posed the two-stage method: screening by Univariate thresholding and cleaning by Penalized 
least squares for Selecting variables, namely UPS. They further theoretically demonstrated 
that under certain settings, the UPS can outperform the LASSO and subset selection, both 
of which are one-stage approaches. This motivates us to develop more effective screening 
procedures using two-stage approaches. 

In this paper, we propose a new feature screening procedure for ultrahigh dimensional 



data based on distance correlation. Szekely, Rizzo and Bakirov (2007) and Szekely and Rizzo 



(2009) showed that the distance correlation of two random vectors equals to zero if and only 
if these two random vectors are independent. Furthermore, the distance correlation of two 
univariate normal random variables is a strictly increasing function of the absolute value 
of the Pearson correlation of these two normal random variables. These two remarkable 
properties motivate us to use the distance correlation for feature screening in ultrahigh 
dimensional data. We refer to our Sure Independence Screening procedure based on the 
Distance Correlation as the DC-SIS. The DC-SIS can be implemented as easily as the SIS. 
It is equivalent to the SIS when both the response and predictor variables are normally 
distributed. However, the DC-SIS has appealing features that existing screening procedures 
including SIS do not possess. For instance, none of the aforementioned screening procedures 
can handle grouped predictors or multivariate responses. The proposed DC-SIS can be 
directly employed for screening grouped variables, and it can be directly utilized for ultrahigh 
dimensional data with multivariate responses. Feature screening for multivariate responses 
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and/or grouped predictors is of great interest in pathway analyses. As in Chen, et al. (2011), 
pathway here means sets of proteins that are relevant to specific biological functions without 
regard to the state of knowledge concerning the interplay among such protein. Since proteins 
may work interactively to perform various biological functions, pathway analyses complement 
the marginal association analyses for individual protein, and aim to detect a priori defined 
set of proteins that are associated with phenotypes of interest. There is a surged interest 



in pathway analyses in the recent literature (Ashburner, et al. 2000 Mootha, et al., 2003 



Subramanian, et al.[ 2005 


Tian, et al. 


2005; 


Bild, et al. , 2006 


Efron and Tibsirani, 2007 



Jones, et al. 2008). Thus, it is of importance to develop feature screening procedures for 



multivariate responses and/or grouped predictors. 

We systematically study the theoretic properties of the DC-SIS, and prove that the DC- 



SIS possesses the sure screening property in the terminology of Fan and Lv (2008) under 



very general model settings including linear regression models, for which Fan and Lv (2008) 
established the sure screening property of the SIS. The sure screening property is a desirable 
property for feature screening in ultrahigh dimensional data. Even importantly, the DC- 
SIS can be used for screening features without specifying a regression model between the 



response and the predictors. Compared with the model-based screening procedures (Fan 



and Lv 


2008 


Fan, Samworth and Wu, 2009 


Wang 


2009 Fan and Song, 2010 Fan, Feng 



and Song, 2011), the DC-SIS is a model-free screening procedure. This virtue makes the 
proposed procedure robust to model mis-specification. This is a very appealing feature of 
the proposed procedure in that it may be very difficult in specifying an appropriate regression 
model for the response and the predictors with little information about the actual model in 
ultrahigh dimensional data. 

We conduct Monte Carlo simulation studies to numerically compare the DC-SIS with the 
SIS and SIRS. Our simulation results indicate that the DC-SIS can significantly outperform 
the SIS and the SIRS under many model settings. We also assess the performance of the 
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DC-SIS as a grouped variable screener, and the simulation results show that the DC-SIS 
performs very well. We further examine the performance of the DC-SIS for feature screening 
in ultrahigh dimensional data with multivariate responses; simulation results demonstrate 
that screening features for multiple responses jointly may have dramatic advantage over 
screening features with each response separately. 

The rest of this paper is organized as follows. In Section 2, we develop the DC-SIS for 
feature screening and establish its sure screening property. In Section 3, we examine the 
finite sample performance of the DC-SIS via Monte Carlo simulations. We also illustrate 
the proposed methodology through a real data example. This paper concludes with a brief 
discussion in Section 4. All technical proofs are given in the Appendix. 

2. INDEPENDENCE SCREENING USING DISTANCE 

CORRELATION 



2.1. Some Preliminaries 



Szekely, Rizzo and Bakirov (2007) advocated using the distance correlation for measur- 
ing dependence between two random vectors. To be precise, let u (t) and v (s) be the 
respective characteristic functions of the random vectors u and v, and U;V (t, s) be the joint 
characteristic function of u and v. They defined the distance covariance between u and v 
with finite first moments to be the nonnegative number dcov(u, v) given by 

dcov 2 (u,v)=/ ||0u,v(t,s) -0 u (t)0 v (s)|| 2 w(t,s)dtds, (2.1) 

where d u and d v are the dimensions of u and v, respectively, and 



W (t,s) = {c (iti c,J|t||^||s||^}- 1 
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with Cd = 7r( 1+ci )/ 2 /T{(l + <f)/2}. Throughout this paper, ||a||d stands for the Euclidean norm 
of a e M d , and ||0|| 2 = 00 for a complex-valued function with being the conjugate of 0. 
The distance correlation (DC) between u and v with finite first moments is defined as 



dcorr(u, v) 



dcov(u, v) 



a/ dcov(u, u)dcov(v, v) 



(2.2) 



Szekely, Rizzo and Bakirov (2007) systematically studied the theoretic properties of the DC. 



Two remarkable properties of the DC motivate us to utilize it in a feature screening 
procedure. The first one is the relationship between the DC and the Pearson correlation 
coefficient. For two univariate normal random variables U and V with the Pearson correlation 



coefficient p, Szekely, Rizzo and Bakirov (2007) and Szekely and Rizzo (2009) showed that 



dcorr({7, V) 



parcsin(p) + yl — p 2 — parcsin(p/2) — a/4 — p 2 + 1 
1 + tt/3 - y/3 



1/2 



(2.3) 



which is strictly increasing in \p\. This property implies that the DC-based feature screening 
procedure is equivalent to the marginal Pearson correlation learning for linear regression 



with normally distributed predictors and random error. In such a situation, Fan and Lv 



(2008) showed that the Pearson correlation learning has the sure screening property. 



The second remarkable property of the DC is dcorr(u,v) = if and only if u and v 



are independent (Szekely, Rizzo and Bakirov, 2007). We note that two univariate random 



variables U and V are independent if and only if U and T(V), a strictly monotone transfor- 
mation of V, are independent. This implies that a DC-based feature screening procedure can 
be more effective than the marginal Pearson correlation learning in the presence of nonlinear 
relationship between U and V. We will demonstrate in the next section that a DC-based 
screening procedure is a model-free procedure in that one does not need to specify a model 
structure between the predictors and the response. 
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Szekely, Rizzo and Bakirov (2007 Remark 3) stated that 

dcov 2 (u, v) = Si + S 2 — 2 S3, 



where Sj, j = 1, 2 and 3, are defined below: 



51 = £{||u-u||dj|v- vlldj, 

5 2 = E{\\u-u\\ du }E{\\v-v\\ dv }, 

5 3 = S{S(||u-u|L|u)£(||v-v|L|v)} 



(2.4) 



where (u, v) is an independent copy of (u,v). 

Suppose that {(iij, Vj),i = 1, • • ■ , n} is a random sample from the population (u, v) 



Szekely, Rizzo and Bakirov (2007) proposed to estimate Si, S2 and S3 through the usual 



moment estimation. To be precise, 



Si 



U; - U 



n 



2 Z-~t A^t 

i=l 3=1 



j\\d u \\ v i v ilU„; 



i=l j=l i=l j=l 

j n n n 

^ 3 = 3EEI]II U *- U 'IUI V ^- V <IL- 

1=1 j=l z=l 

Thus, a natural estimator of dcov 2 (u,v) is given by 

dcov (u, v) = Si + S 2 — 2S 3 . 
Similarly, we can define the sample distance covariances dcov(u, u) and dcov(v,v). Accord- 
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ingly, the sample distance correlation between u and v can be denned by 

dcov(u, v) 



dcorr(u, v) 



dcov(u, u)dcov(v, v) 



2.2. An Independence Ranking and Screening Procedure 

In this section we propose an independence screening procedure built upon the DC. Let 
y = (Yi, ■ ■ ■ , Y q ) T be the response vector with support ty y , and x = (X%, . . . , X p ) T be the 
predictor vector. We regard q as a fixed number in this context. In an ultrahigh-dimensional 
setting the dimensionality p greatly exceeds the sample size n. It is thus natural to assume 
that only a small number of predictors are relevant to y. Denote by F(y | x) the conditional 
distribution function of y given x. Without specifying a regression model, we define the 
index set of the active and inactive predictors by 

V = {k : F(y | x) functionally depends on X k for some y G ^ y }, 

X = {k : F(y | x) does not functionally depend on X k for any y G ^ y }- (2.5) 

We further write x© = {Xk : k G T>} and Xx = {Xf. : k G X}, and refer to x© as an active 
predictor vector and its complement xx as an inactive predictor vector. The index subset T> 
of all active predictors or, equivalently, the index subset X of all inactive predictors, is the 



objective of our primary interest. Definition (2.5) implies that yiLxj | Xx>, where _LL denotes 



statistical independence. That is, given xp, the remaining predictors xj are independent of 
y. Thus the inactive predictors x^ are redundant when the active predictors xp are known. 

For ease of presentation, we write 



Uk = dcorr 2 (X fe , y), and Q k = dcorr (X k , y), for k = 1, • • • ,p 
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based on a random sample {xj,yj}, % = l,...,n. We consider using u>k as a marginal 
utility to rank the importance of at the population level. We utilize the DC because it 
allows for arbitrary regression relationship of y onto x, regardless of whether it is linear or 
nonlinear. The DC also permits univariate and multivariate response, regardless of whether 
it is continuous, discrete or categorical. In addition, it allows for groupwise predictors. Thus, 
this DC based screening procedure is completely model-free. We select a set of important 
predictors with large Uk- That is, we define 

V* = {k : Q k > cn- K , for 1 < k < p] , 

where c and k are pre-specified threshold values which will be defined in condition (C2) in 
the subsequent section. 

2.3. Theoretical Properties 

Next we study the theoretical properties of the proposed independence screening procedure 
built upon the DC. The following conditions are imposed to facilitate the technical proofs, 
although they may not be the weakest ones. 

(CI) Both x and y satisfy the sub-exponential tail probability uniformly in p. That is, there 
exists a positive constant s such that for all < s < 2s , 

sup max E {exp(s||X fc ||^)j < oo, and -E{exp(s||y ||^)} < oo. 

(C2) The minimum distance correlation of active predictors satisfies 

mincjfc > 2cn~ K , for some constants c > and < k < 1/2. 
kev 

Condition (CI) follows immediately when x and y are bounded uniformly, or when they 
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have multivariate normal distribution. The normality assumption has been widely used in 
the area of ultrahigh dimensional data analysis to facilitate the technical derivations. See, 



for example, Fan and Lv (2008) and Wang (2009). 



Next we explore condition (C2). When x and y have multivariate normal distribution, 



(2.3) gives an explicit relationship between the DC and the squared Pearson correlation. For 



simplicity, we write dcorr(X fc ,y) = T (\p(X k , y)|) where T (-) is strictly increasing given in 



(2.3). In this situation, condition (C2) requires essentially that min|p(X fe ,y)| > T inv (2cn K ) 



where Ti nv (-) is the inverse function of Tq(-). This is parallel to condition 3 of Fan and Lv 



(2008) where it is assumed that min|p(Xfc,y)| > 2cn K . This intuitive illustration implies 

k£T> 



that condition (C2) requires that the marginal DC of active predictors cannot be too small, 



which is similar to condition 3 of Fan and Lv (2008). We remark here that, although 



we illustrate the intuition by assuming that x and y are multivariate normal, we do not 
require this assumption explicitly in our context. The following theorem establishes the sure 
screening property for the DC-SIS procedure. 



Theorem 1. Under condition (CI), for any < 7 < 1/2 — k, there exist positive constants 
c\ > and C2 > such that 



Pr ^max \u)k — Uk\ > cn K ^ < O (p [exp {— c\n x 2 ( K+7 ) j -f nexp (— C2?i, 7 )]) 



(2.6) 



Under conditions (CI) and (C2), we have that 



Pr(v C V*\ > 1 - O (s n [exp {-cin^ 2(K+7) } + nexp (-c 2 n 7 )]) 



(2.7) 



where s n is the cardinality ofD. 

The sure screening property holds for the DC-SIS under milder conditions than those for 



the SIS (Fan and Lv, 2008) in that we do not require the regression function of y onto x 
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to be linear. Thus, the DC-SIS provides a unified alternative to existing model-based sure 
screening procedures. Compared with the SIRS, the DC-SIS can effectively handle grouped 
predictors and multivariate responses. 



To balance the two terms in the right hand side of (2.6), we choose the optimal order 
7 = (1 — 2/c)/3, then the first part of Theorem [l] becomes 

Pr^ max \u) k - u k \ > cn~ K ^J < O (p [exp {-Cin (1 ~ 2K)/3 }] ) , 

for some constant c\ > 0, indicating that we can handle the NP-dimensionality of order 
logp = o fn^ 1 " 2 ' 1 )/ 3 ) . If we further assume that and y are bounded uniformly in p, then 
we can obtain without much difficulty that 

Pr (^ m & x \£>k — > cn~ K ^j < O (p [exp {— Cin 1 ~ 2/t }]) . 

In this case, we can handle the NP-dimensionality logp = o(n 1 ~ 2K ) . 



3. NUMERICAL STUDIES 

In this section we assess the performance of the DC-SIS by Monte Carlo simulation. Our 
simulation studies were conducted using R code. We further illustrate the proposed screening 
procedure with an empirical analysis of a real data example. 

In Examples 1, 2 and 3, we generate x = (X±, X2, ■ ■ ■ ,X P ) T from normal distribution 
with zero mean and covariance matrix S = (aij) pxp , and the error term e from standard 
normal distribution J\f(0, 1). We consider two covariance matrices to assess the performance 
of the DC-SIS and to compare with existing methods: (i) a,^ = and (ii) = 0.5'* - - 7 '. 

We fix the sample size n to be 200 and vary the dimension p from 2,000 to 5,000. We 

10 



repeat each experiment 500 times, and evaluate the performance through the following three 
criteria. 



1. S: the minimum model size to include all active predictors. We report the 5%, zo7o, 
50%, 75% and 95% quantiles of S out of 500 replications. 

2. V s : the proportion that an individual active predictor is selected for a given model size 
d in the 500 replications. 

3. V a : the proportion that all active predictors are selected for a given model size d in 
the 500 replications. 

The S is used to measure the model complexity of the resulting model of an underlying 
screening procedure. The closer to the minimum model size the S is, the better the screening 
procedure is. The sure screening property ensures that V s and V a are both close to one 
when the estimated model size d is sufficiently large. We choose d to be d\ = [n/logn], 
d 2 = 2 [n/logn] and d 3 = 3 [n/logn] throughout our simulations to empirically examine the 
effect of the cutoff, where [a] denotes the integer part of a. 

Example 1. This example is designed to compare the finite sample performance of the 



DC-SIS with the SIS (Fan and Lv, 2008) and SIRS (Zhu, Li, Li and Zhu, 2011). In this 



example, we generate the response from the following four models: 



(l.a): 


Y 


= c x p x X x + c 2 (3 2 X 2 + c 3 (3 3 l(X 12 < 0) + c 4 /3 4 X 22 + e, 


(l.b): 


Y 


= c x fi x X x X 2 + c 3 /3 2 l(X 12 < 0) + c 4 /3 3 X 22 + e, 


(l.c): 


Y 


= c 1 /3 1 X 1 X 2 + c 3 /3 2 l(X 12 <0)X 22 + £, 


(l.d): 


Y 


= cxfcXi + c 2 p 2 X 2 + c 3 p 3 l{X 12 < 0) + exp(c 4 |X 22 |)£ 



where 1(X 12 < 0) is an indicator function. The regression functions E(Y | x) in models 
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Table 1: The 5%, 25%, 50%, 75% and 95% quantiles of the minimum model size S out of 500 
replications in Example 1. 



c 
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SIS 


SIRS 


DC-SIS 


IVlUQtrl 


5% 


25% 


UU /o 


1 o /o 
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sew 
J /o 


9^% 
zu /o 


ou /o 


/ o /o 


•30 /0 


o /o 


9^% 
^o /o 


ou /o 


7^% 
1 o /o 


yo /o 
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— U.O 1 
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4.0 


4.0 
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7.0 
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5.0 


7 n 
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4.0 


4.0 


6.0 


18.0 


(l.b) 


68.0 


578.5 


1180.5 


1634.5 


1938.0 


232.9 


871.5 


1386.0 


1725.2 


1942.4 


5.0 


9.0 


24.5 


73.0 


345.1 


(l.C) 


395.9 


1037.2 


1 a^s n 

1-tOO. u 


1 7zL^ n 
1 ( 10. u 
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i Q9n n 
lozu.u 


ioy / .u 
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U.U 


in n 

1U.U 


99 n 


oy.u 


^9A 1 




130.5 


611.2 


1166.0 


1637.0 


1936.5 


42.0 


304.2 


797.0 


1432.2 


1846.1 


4.0 


5.0 


9.0 


41.0 


336.2 




p = 2000 and 


n oli- 

— U.O 1 
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trn7 9 
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283.2 


ooz.U 


1041. 1 


Kim n 

lyiy.u 


iUo.9 
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1 fion 9 

ioyy.2 
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1 7 n 
1 1 .0 


98. U 


(l.c) 


224.5 


775.2 


1 940 ^ 
iz^iy.o 


1U (U.U 


iyoi . i 


1 1 8 fi 

1 lo.U 


r »7^ 9 
O ( o.z 


1 90.1 K 

1ZU1 .0 


i fiOK 9 

1UOJ.Z 


lyoo.u 


7 n 
/ .u 


in n 

1U.U 


10. u 


oo.u 


lyo.o 


1,-L.a; 


79.0 


583.8 


1107.5 


1626.2 


1930.0 


50.9 


300.5 


728.0 


1368.2 


1900.1 


4.0 


7.0 


17.0 


73.2 


653.1 


case 3 


p = 5000 and 


= 0.5^ 


-il 
























(l.a) 


4.0 


4.0 


5.0 


6.0 


59.0 


4.0 


4.0 


5.0 


7.0 


88.4 


4.0 


4.0 


4.0 


6.0 


34.1 


(l.b) 


165.1 


1112.5 


2729.0 


3997.2 


4851.5 


560.8 


1913.0 


3249.0 


4329.0 


4869.1 


5.0 


11.8 


45.0 


168.8 


956.7 


(l.c) 


1183.7 


2712.0 


3604.5 


4380.2 


4885.0 


440.4 


1949.0 


3205.5 


4242.8 


4883.1 


7.0 


17.0 


53.0 


179.5 


732.0 


(l.d) 


259.9 


1338.5 


2808.5 


3990.8 


4764.9 


118.7 


823.2 


1833.5 


3314.5 


4706.1 


4.0 


5.0 


15.0 


77.2 


848.2 


case 4 


p = 5000 and <7ij 


= 0.8^ 


-il 
























(l.a) 


5.0 


10.0 


26.5 


251.5 


2522.7 


5.0 


10.0 


28.0 


324.8 


3246.4 


5.0 


8.0 


14.0 


69.0 


1455.1 


(l.b) 


40.7 


639.8 


2072.0 


3803.8 


4801.7 


215.7 


1677.8 


3010.0 


4352.2 


4934.1 


5.0 


8.0 


11.0 


21.0 


162.0 


(l.c) 


479.2 


1884.8 


3347.5 


4298.5 


4875.2 


297.7 


1359.2 


2738.5 


4072.5 


4877.6 


8.0 


12.0 


22.0 


83.0 


657.9 


(l.d) 


307.0 


1544.0 


2832.5 


4026.2 


4785.2 


148.2 


672.0 


1874.0 


3330.0 


4665.2 


4.0 


7.0 


21.0 


165.2 


1330.0 



(l.a)-(l.d) are all nonlinear in X12. In addition, models (l.b) and (l.c) contain an inter- 
action term X1X2, and model (l.d) is heteroscedastic. Following Fan and Lv (2008), we 
choose (3j = (—1)^(0 + \Z\) for j = 1,2,3 and 4, where a = 41ogn/ \/n, U ~ Bernoulli(0.4) 
and Z ~ Af(0,l). We set (01,02,03,04) = (2,0.5,3,2) in this example to challenge the fea- 
ture screening procedures under consideration. For each independence screening procedure, 
we compute the associated marginal utility between each predictor and the response Y . 
That is, we regard x = {X\, . . . , X P ) T e MP as the predictor vector in this example. 

Tables [T] and [2] depict the simulation results for S, V s and V a . The performances of the 
DC-SIS, SIS and SIRS are quite similar in model (l.a), indicating that the SIS has a robust 
performance if the working linear model does not deviate far from the underlying true model. 
The DC-SIS outperforms the SIS and SIRS significantly in models (l.b), (l.c) and (l.d). 
Both the SIS and SIRS have little chance to identify the important predictors X\ and X2 in 
models (l.b) and (l.c), and X22 in model (l.d). 

Example 2. We illustrate that the DC-SIS can be directly used for screening grouped 
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Table 2: The proportions of V s and V a in Example 1. The user-specified model sizes d\ = 
[ra/logn], c?2 = 2[n/logn] and c?3 = 3[n/logn]. 





SIS 


SIRS 


DC-SIS 




v s 


v a 


Vs 


Va 


v s 


v a 


model 


size 




x 2 




A" 22 


ALL 


Xj 


x 2 


Xl2 


X22 


ALL 


Xj 


x 2 


Xl 2 


A" 22 


ALL 


case 1: 


p = 2000 and aij 


= 0.5 1 ' 


-j\ 


























di 


1.00 


1.00 


0.96 


1.00 


0.96 


1.00 


1.00 


0.95 


1.00 


0.94 


1.00 


1.00 


0.97 


1.00 


0.96 


(l.a) 


d 2 


1.00 


1.00 


0.98 


1.00 


0.97 


1.00 


1.00 


0.96 


1.00 


0.96 


1.00 


1.00 


0.98 


1.00 


0.98 




d 3 


1.00 


1.00 


0.98 


1.00 


0.98 


1.00 


1.00 


0.97 


1.00 


0.97 


1.00 


1.00 


0.99 


1.00 


0.98 




di 


0.08 


0.07 


0.97 


1.00 


0.03 


0.02 


0.03 


0.98 


1.00 


0.00 


0.72 


0.70 


0.99 


1.00 


0.58 


(l.b) 


d2 


0.12 


0.13 


0.98 


1.00 


0.06 


0.05 


0.05 


0.99 


1.00 


0.01 


0.85 


0.84 


1.00 


1.00 


0.76 




d 3 


0.15 


0.17 


0.99 


1.00 


0.07 


0.06 


0.06 


0.99 


1.00 


0.01 


0.89 


0.88 


1.00 


1.00 


0.82 




di 


0.12 


0.13 


0.01 


0.99 


0.00 


0.04 


0.03 


0.51 


1.00 


0.01 


0.93 


0.93 


0.77 


1.00 


0.65 


(l.c) 


d 2 


0.17 


0.18 


0.03 


0.99 


0.00 


0.07 


0.05 


0.67 


1.00 


0.01 


0.97 


0.96 


0.84 


1.00 


0.79 




d 3 


0.21 


0.21 


0.05 


0.99 


0.00 


0.09 


0.08 


0.75 


1.00 


0.02 


0.98 


0.97 


0.89 


1.00 


0.84 




di 


0.42 


0.22 


0.14 


0.42 


0.02 


1.00 


0.98 


0.87 


0.05 


0.04 


1.00 


0.91 


0.81 


0.99 


0.73 


(l.d) 


d 2 


0.48 


0.29 


0.22 


0.50 


0.03 


1.00 


0.99 


0.91 


0.10 


0.09 


1.00 


0.94 


0.87 


1.00 


0.82 




d 3 


0.56 


0.32 


0.26 


0.54 


0.04 


1.00 


0.99 


0.93 


0.12 


0.11 


1.00 


0.96 


0.92 


1.00 


0.88 


case 2: 


p = 2000 and 


= 0.8 1 ' 


-j\ 


























di 


1.00 


1.00 


0.63 


1.00 


0.63 


1.00 


1.00 


0.62 


1.00 


0.62 


1.00 


1.00 


0.78 


1.00 


0.77 


(l.a) 


d 2 


1.00 


1.00 


0.71 


1.00 


0.72 


1.00 


1.00 


0.70 


1.00 


0.69 


1.00 


1.00 


0.84 


1.00 


0.84 




d 3 


1.00 


1.00 


0.77 


1.00 


0.78 


1.00 


1.00 


0.75 


1.00 


0.75 


1.00 


1.00 


0.86 


1.00 


0.86 




di 


0.12 


0.13 


0.81 


1.00 


0.06 


0.04 


0.04 


0.88 


1.00 


0.02 


0.97 


0.98 


0.92 


1.00 


0.88 


(l.b) 


d 2 


0.19 


0.19 


0.86 


1.00 


0.12 


0.07 


0.07 


0.91 


1.00 


0.03 


0.99 


0.99 


0.95 


1.00 


0.94 




d 3 


0.22 


0.23 


0.88 


1.00 


0.15 


0.09 


0.11 


0.93 


1.00 


0.06 


1.00 


0.99 


0.96 


1.00 


0.96 




di 


0.17 


0.16 


0.03 


0.99 


0.00 


0.04 


0.04 


0.53 


1.00 


0.02 


1.00 


1.00 


0.75 


1.00 


0.75 


(l.c) 


d 2 


0.22 


0.22 


0.06 


1.00 


0.01 


0.08 


0.08 


0.71 


1.00 


0.03 


1.00 


1.00 


0.85 


1.00 


0.86 




d 3 


0.27 


0.27 


0.10 


1.00 


0.03 


0.10 


0.10 


0.81 


1.00 


0.05 


1.00 


1.00 


0.90 


1.00 


0.90 




di 


0.44 


0.38 


0.11 


0.45 


0.03 


1.00 


1.00 


0.73 


0.05 


0.04 


0.99 


0.98 


0.68 


1.00 


0.67 


(l.d) 


d 2 


0.51 


0.46 


0.18 


0.53 


0.05 


1.00 


1.00 


0.81 


0.09 


0.08 


1.00 


0.98 


0.76 


1.00 


0.75 




d 3 


0.55 


0.49 


0.22 


0.57 


0.06 


1.00 


1.00 


0.84 


0.14 


0.11 


1.00 


0.99 


0.80 


1.00 


0.80 


case 3: 


p = 5000 and cr^ 


= 0.5l ! 


-i\ 


























di 


1.00 


1.00 


0.94 


1.00 


0.94 


1.00 


0.99 


0.92 


1.00 


0.92 


1.00 


0.99 


0.96 


1.00 


0.95 


(l.a) 


d 2 


1.00 


1.00 


0.95 


1.00 


0.95 


1.00 


1.00 


0.95 


1.00 


0.95 


1.00 


1.00 


0.97 


1.00 


0.97 




d 3 


1.00 


1.00 


0.96 


1.00 


0.96 


1.00 


1.00 


0.96 


1.00 


0.96 


1.00 


1.00 


0.98 


1.00 


0.98 




di 


0.06 


0.06 


0.94 


1.00 


0.02 


0.02 


0.02 


0.96 


1.00 


0.00 


0.59 


0.60 


0.98 


1.00 


0.46 


(l.b) 


d 2 


0.09 


0.09 


0.96 


1.00 


0.03 


0.03 


0.03 


0.97 


1.00 


0.01 


0.72 


0.72 


0.99 


1.00 


0.61 




d 3 


0.12 


0.10 


0.97 


1.00 


0.04 


0.05 


0.04 


0.98 


1.00 


0.01 


0.79 


0.78 


0.99 


1.00 


0.68 




di 


0.06 


0.06 


0.01 


0.99 


0.00 


0.03 


0.02 


0.30 


1.00 


0.00 


0.86 


0.87 


0.61 


1.00 


0.41 


(l.c) 


d 2 


0.10 


0.10 


0.02 


1.00 


0.00 


0.04 


0.03 


0.45 


1.00 


0.00 


0.92 


0.93 


0.69 


1.00 


0.57 




d 3 


0.12 


0.12 


0.02 


1.00 


0.00 


0.05 


0.05 


0.53 


1.00 


0.00 


0.94 


0.95 


0.73 


1.00 


0.64 




di 


0.39 


0.21 


0.11 


0.40 


0.01 


1.00 


0.97 


0.82 


0.02 


0.02 


0.99 


0.87 


0.74 


0.99 


0.65 


(l.d) 


d 2 


0.44 


0.24 


0.14 


0.45 


0.01 


1.00 


0.98 


0.88 


0.04 


0.03 


0.99 


0.90 


0.81 


0.99 


0.75 




d 3 


0.48 


0.28 


0.17 


0.47 


0.02 


1.00 


0.99 


0.90 


0.06 


0.05 


0.99 


0.92 


0.85 


1.00 


0.79 


case 4: 


P = 


5000 and 


= 0.8 1 ' 


-i\ 


























di 


1.00 


1.00 


0.55 


1.00 


0.55 


1.00 


1.00 


0.55 


1.00 


0.55 


1.00 


1.00 


0.70 


1.00 


0.69 


(l.a) 


d 2 


1.00 


1.00 


0.61 


1.00 


0.62 


1.00 


1.00 


0.61 


1.00 


0.61 


1.00 


1.00 


0.76 


1.00 


0.76 




d 3 


1.00 


1.00 


0.67 


1.00 


0.67 


1.00 


1.00 


0.64 


1.00 


0.64 


1.00 


1.00 


0.80 


1.00 


0.80 




di 


0.10 


0.09 


0.74 


1.00 


0.05 


0.02 


0.02 


0.83 


1.00 


0.00 


0.94 


0.94 


0.90 


1.00 


0.82 


(l.b) 


d 2 


0.12 


0.13 


0.81 


1.00 


0.07 


0.03 


0.04 


0.87 


1.00 


0.01 


0.97 


0.97 


0.93 


1.00 


0.89 




d 3 


0.15 


0.16 


0.84 


1.00 


0.10 


0.05 


0.06 


0.90 


1.00 


0.02 


0.98 


0.98 


0.95 


1.00 


0.92 




di 


0.10 


0.10 


0.02 


0.98 


0.00 


0.02 


0.03 


0.34 


1.00 


0.00 


1.00 


1.00 


0.64 


1.00 


0.63 


(l.c) 


d 2 


0.13 


0.14 


0.04 


0.99 


0.01 


0.04 


0.04 


0.50 


1.00 


0.01 


1.00 


1.00 


0.74 


1.00 


0.74 




d 3 


0.16 


0.18 


0.05 


0.99 


0.01 


0.05 


0.05 


0.61 


1.00 


0.02 


1.00 


1.00 


0.79 


1.00 


0.79 




di 


0.42 


0.32 


0.09 


0.40 


0.01 


1.00 


1.00 


0.66 


0.02 


0.01 


0.99 


0.97 


0.63 


0.98 


0.59 


(l.d) 


d 2 


0.48 


0.39 


0.12 


0.44 


0.02 


1.00 


1.00 


0.74 


0.04 


0.03 


0.99 


0.97 


0.70 


1.00 


0.68 




d 3 


0.51 


0.42 


0.15 


0.46 


0.02 


1.00 


1.00 


0.78 


0.05 


0.04 


0.99 


0.98 


0.73 


1.00 


0.71 
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predictors. In many regression problems, some predictors can be naturally grouped. The 
most common example which contains group variables is the multi-factor ANOVA problem, 
in which each factor may have several levels and can be expressed through a group of dummy 
variables. The goal of ANOVA is to select important main effects and interactions for 
accurate predictions, which amounts to the selection of groups of dummy variables. To 
demonstrate the practicability of the DC-SIS, we adopt the following model: 

Y = c 1 /3 1 X 1 + c 2 f3 2 X 2 + c 3 f3 3 {l(X 12 <q 1 ) + 1.5xl(q 1 <X 12 <q 2 ) 
+2 x l(q 2 < X 12 <q 3 )} + c 4 ^X 22 + e, 

where qi, q 2 and q 3 are the 25%, 50% and 75% quantiles of X 12 , respectively. The variables 
X with the coefficients q's and /Vs are the same as those in Example 1. We write 

X12 = {1(X 12 < gi), l(gi < X 12 < q 2 ), l(q 2 < X X2 < q 3 ))} T . 

These three correlated variables naturally become a group. The predictor vector in this 
example becomes x = (Xi, . . . , An, xi 2 , Xi 3 , . . . , X P ) T E MP +2 . We remark here that the 
marginal utility of the grouped variable X42 is defined by 

cDi2 = dcorr (xi 2 , Y). 

The 5%, 25%, 50%, 75% and 95% percentiles of the minimum model size S are summarized in 
Table [3j These percentiles indicate that with very high probability, the minimum model size 
S to ensure the inclusion of all active predictors is small. Note that [n/log(n)] = 37. Thus, 
almost all V s s and V a s equal 100%. All active predictors including the grouped variable 
X12 can almost perfectly be selected into the resulting model across all three different model 
sizes. Hence, the DC-SIS is efficient to select the grouped predictors. 
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Table 3: The 5%, 25%, 50%, 75% and 95% quantiles of the minimum model size S out of 500 
replications in Example 2. 



s 




V 


= 2000 






P 


= 5000 






5% 


25% 


50% 75% 


95% 


5% 


25% 


50% 75% 


95% 


Vij = 0.51* 


-i\ 


4.0 


4.0 


4.0 5.0 


12.0 


4.0 


4.0 


4.0 6.0 


16.1 


a i} = 0.81* 


-i\ 


4.0 


5.0 


7.0 9.0 


15.2 


4.0 


5.0 


7.0 9.0 


21.0 



Example 3. In this example, we investigate the performance of the DC-SIS with multivari- 



ate responses. The SIS proposed in Fan and Lv (2008) cannot be directly applied for such 



settings. In contrast, the DC-SIS is ready for screening the active predictors by the nature 
of DC. In this example, we generate y = (Yi, Y"2) T from normal distribution with mean zero 
and covariance matrix S y | x = (cr x ,ij)2x2, where cr X) n = cr x ,22 = 1 and cr x .i2 = 0x.2i = °"( x )- 
We consider two scenarios for the correlation function <r(x): 

(3.a): o-(x) = sin(/3?x), where /3 1 = (0.8, 0.6, 0, . . . , 0) T . 

(3.b): o-(x) = {exp^x) - 1} / {exp(^x) + 1}, where (3 2 = (2 - U x , 2-U 2 ,2-U 3 ,2- 
U4, 0, . . . , 0) T with C/j's being independent and identically distributed according to uni- 
form distribution Uniform[0, 1]. 

Tables [4] and [5] depict the simulation results. Table [4] implies that the DC-SIS performs 
reasonably well for both models (3. a) and (3.b) in terms of model complexity. Table [5] 
indicates that the proportions that the active predictors are selected into the model are close 
to one, which supports the assertion that the DC-SIS processes the sure screening property. It 
implies that the DC-SIS can identify the active predictors contained in correlations between 
multivariate responses. This may be potentially useful in gene co-expression analysis. 



Example 4. The Cardiomyopathy microarray dataset was once analyzed by Segal, Dahlquist 



and Conklin (2003) and Hall and Miller (2009). The goal is to identify the most influential 



genes for overexpression of a G protein-coupled receptor (Rol) in mice. The response Y is 
the Rol expression level, and the predictors X^s are other gene expression levels. Compared 
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Table 4: The 5%, 25%, 50%, 75% and 95% quantiles of the minimum model size S out of 500 
replications in Example 3. 



s 


p = 2000 


p = 5000 


Model 


5% 25% 50% 75% 95% 


5% 25% 50% 75% 95% 


a i3 = 0.5l'- J l (3.a) 
(3.b) 


4.0 9.0 18.0 39.3 112.3 
6.0 19.0 43.0 92.0 253.1 


6.0 22.0 48.0 95.3 296.4 
14.0 45.0 92.5 198.8 571.6 


cr t . = O.S 1 ''--^! (3.a) 
(3.b) 


2.0 3.0 6.0 12.0 40.0 
4.0 4.0 4.0 6.0 10.0 


2.0 6.0 14.0 32.0 98.0 
4.0 4.0 5.0 8.0 18.1 



Table 5: The proportions of V s and V a in Example 3. The user-specified model sizes o?i 
[n/logn], d,2 = 2[n/logn] and = 3[n/logn]. 











P = 


2000 














P = 


5000 










(3.a) 


(3.b) 


(3.a) 


(3.b) 




V a 


V a 


Vs 


Va 


v s 


Va 


V a 


Va 


size 


Xi 


x 2 


ALL 


x x 


x 2 


x 3 


Xi 


ALL 


Xi 


x 2 


ALL 


Xi 


x 2 


x 3 


x 4 


ALL 




di 


0.95 


0.76 


0.74 


0.71 


0.98 


0.98 


0.72 


0.47 


0.79 


0.49 


0.42 


0.48 


0.91 


0.90 


0.53 


0.20 


ffy = 0.51^1 




0.98 


0.90 


0.90 


0.85 


0.99 


0.99 


0.85 


0.71 


0.93 


0.70 


0.67 


0.67 


0.97 


0.97 


0.71 


0.45 




d 3 


1.00 


0.95 


0.95 


0.91 


0.99 


1.00 


0.90 


0.81 


0.97 


0.81 


0.80 


0.75 


0.98 


0.99 


0.78 


0.55 




di 


0.98 


0.95 


0.94 


1.00 


1.00 


1.00 


1.00 


1.00 


0.92 


0.84 


0.81 


1.00 


1.00 


1.00 


0.99 


0.99 


a {j = 0.8^1 


d 2 


1.00 


0.98 


0.99 


1.00 


1.00 


1.00 


1.00 


1.00 


0.98 


0.95 


0.93 


1.00 


1.00 


1.00 


1.00 


1.00 




d 3 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


0.99 


0.96 


0.96 


1.00 


1.00 


1.00 


1.00 


1.00 



with the sample size n = 30 in this dataset, the dimension p = 6319 is very large. 

The DC-SIS procedure ranks two genes, labeled Msa.2134.0 and Msa.2877.0, at the top. 
The scatter plots of Y versus these two gene expression levels with cubic spline fit curves 
in Figure 1 indicate clearly the existence of nonlinear patterns. Yet, our finding is different 



from |Hall and Miller| fl2009[ ) in that they ranked Msa.2877.0 and Msa.1166.0 at the top with 
their proposed generalized correlation ranking. A natural question arises: which screening 
procedure performs better in terms of ranking? To compare the performance of these two 
procedures, we fit an additive model as follows: 



Y = £ k i(X kl ) + £ k2 (X k2 ) + e k , for k = 1,2. 



The DC-SIS, corresponding to k — 1, regards Msa.2134.0 and Msa.2877.0 as the two pre- 



dictors, while the generalized correlation ranking proposed by Hall and Miller (2009), corre- 
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sponding to k = 2, regards Msa. 2877.0 and Msa. 1166.0 as predictors in the above model. We 
fit the unknown link functions 1^ using the R mgcv package. The DC-SIS method clearly 
achieves better performance with the adjusted R 2 of 96.8% and the deviance explained of 
98.3%, in contrast to the adjusted R 2 of 84.5% and the deviance explained of 86.6% for the 
generalized correlation ranking method. We remark here that deviance explained means the 
proportion of the null deviance explained by the proposed model, with a larger value indi- 
cating better performance. Because both the adjusted R 2 values and the explained deviance 
are very large, it seems unnecessary to extract any additional genes. 




i i i i i i i i i i i i i 

-1.0 -0.5 0.0 0.5 1.0 1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 

Msa.2877.0 Msa.2134.0 

Figure 1. The scatter plot ofY versus two genes expression levels identified by the DC-SIS. 



4. DISCUSSION 



In this paper we proposed a sure independence screening procedure using distance corre- 
lation. We established the sure screening property for this procedure when the number of 
predictors diverges with an exponential rate of the sample size. We examined the finite- 
sample performance of the proposed procedure via Monte Carlo studies and illustrated the 



proposed methodology through a real data example. We followed Fan and Lv (2008) to set 



the cutoff d in this paper and examine the effect of different values of d. As pointed out 



by a referee, the choice of d is very important in the screening stage. Zhao and Li (2012) 
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proposed an approach to selecting d for Cox models based on controlling false positive rate. 



Their approach is merely for model-based feature screening methods. Zhu, Li, Li and Zhu 



(2011) proposed an alternative method to determine d for the SIRS. One may adopt their 
procedure for the DC-SIS. We opt not to pursue this further. Certainly, the selection of 
d is similar to selection of the tuning parameter in regularization methods, and plays an 
important role in practical implementation. This is a good topic for future research. 

Similar to the SIS, the DC-SIS may fail to identify some important predictors which 
are marginally independent of the response. Thus, it is of interest to develop an iterative 
procedure to fix such an issue. In the earlier version of this paper, we proposed an iterative 
version of DC-SIS. Our empirical studies including Monte Carlo simulation and real data 
analysis imply that the proposed iterative DC-SIS may be used to fix the problem in a similar 



spirit of ISIS (Fan and Lv, 2008). Theoretical analysis of the iterative DC-SIS needs further 
study. New methods to deal with identification of important predictors which are marginally 
independent of the response is an important topic for future research. 

APPENDIX 

Appendix A: Some Lemmas 

Lemmas [T] and [2] will be used repeatedly in the proof of Theorem [T] These two lemmas 
provide us two exponential inequalities, and are extracted from Lemma 5. 6.1. A and Theorem 



5.6. l.A of Serfling (1980, page 200-201) 



Lemma 1. Let y, = E(Y). IfPr(a<Y <b) = I, then 

E [exp {s{Y -//)}] < exp [s 2 {b - a) 2 /8} , for any s > 0. 
Lemma 2. Let h(Yi, ■ ■ • , Y m ) be a kernel of the U -statistic U n , and 8 = E {h(Y\, • • • , Y m )}. 
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If a < h(Yi, ■ ■ ■ , Y m ) < b, then, for any t > and n > m, 

Pr(U n - > i) < exp {-2[n/m]t 2 /(& -a) 2 }, 

where [n/m] denotes the integer part ofn/m. 

Due to the symmetry of [/-statistic, Lemma [2] entails that 

Pr(|t7 n -0| >t) < 2exp{-2[n/m)t 2 /(b- a) 2 } . 

Let us introduce some notations before giving the proof of Theorem [l] Let {X k ,y} 
be an independent copy of {X k ,y}, and define Ski = ^H^fc — Xfe||i||y — y|| 9 , S k 2 = 
E\\X k -X k \\iE\\y-y\\ q , and S k3 = E{E(\\X k - X k \\i\X k )E(\\y - y|| 9 |y)}, and their sample 
counterparts 



Sfcl — W-^-ik — -^ifc||i||yj — Yill?) 



^ n 1 n 

Sk2 = Yl _ x 3k\u^ Yl ii yi ~ y j 

i,j=l i,j=l 
1 n 

^3 = H-X'ifc - -2Qfc||i||yi - yi 



By definitions of distance covariance and sample distance covariance, it follows that 



dcov 2 (X fc ,y) = S k i + S k2 - 2S k3 and dcov (X k ,y) = S kl + S k2 - 2S k3 . 



Appendix B: Proof of Theorem [I] 

We aim to show the uniform consistency of the denominator and the numerator of Q k 
under regularity conditions respectively. Because the denominator of Q k has a similar form 
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as the numerator, we deal with its numerator only below. Throughout proof, the notations 
C and c are generic constants which may take different values at each appearance. 

We first deal with Ski- Define = {n(n — — ^jfc||i||yi — YjWq, which is a 

usual [/-statistic. We shall establish the uniform consistency of S^ by using the theory of 



[/-statistics (Serfling, 1980, Section 5). By using the Cauchy-Schwartz inequality, 



2\ "I 1/2 



< 



A{E{Xl)E\\y\\l} 



21 1/2 



This together with condition (CI) implies that Ski is uniformly bounded in p, that is, 
sup max Sjsi < oo. For any given e > 0, take n large enough such that S k i/n < e. Then it 
can be easily shown that 



Pr(\S kl -S kl \ >2e)=Px{\S* kl (n-l)/n-S kl (n-l)/n-S kl /n\ > 2e) 

< Pr{\S* kl - S kl \(n - l)/n > 2e - S kl /n} (B.l) 
^Prd^-^l > e ). 

To establish the uniform consistency of Ski, it thus suffices to show the uniform consistency 
of Sl±. Let hi (X ik , yi ; Xj k , yj) = \\ X ik — Xj k \ | 1 1 1 yj — y,- 1 1 q be the kernel of the [/-statistic S^ . 
We decompose the kernel function hi into two parts: hi = hil(hi > M) + h{\.{hi < M) 
where M will be specified later. The [/-statistic can now be written as follows, 

s ti = { n ( n - l)} _1 5Z^i(^iA!,yi;^ifc,yi)l {hi(X ik ,yi; X jk ,yj) < M} 
+ {n(n- 1)}~ X ) j hi(X ik , yf, X jk , y 3 -)l {hi(X ik , y,; X jk , yj) > M} 
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Accordingly, we decompose Ski into two parts: 



Ski = E [hi(X ik , yi] X jk , yj)l{hi(X ik , yf, X jk , y,-) < M}] 
+ E [hi(X ik , yi\ Xj k , yj)l {hi(X ik , y«; X jk , yj) > M}\ 
— S k i t i + S k i£- 

Clearly, Sl l t and S% 12 are unbiased estimators of S k i t i and Skip, respectively 

We deal with the consistency of S kl 1 first. With the Markov's inequality, for any t > 0, 
we can obtain that 

Pr(5^ 1 1 - S k i,i >e)< exp (-te) exp(-t5 r fe i ) i)^{exp(t^ 1 



Serfling (1980, Section 5.1.6) showed that any [/-statistic can be represented as an average 
of averages of independent and identically distributed (i.i.d) random variables. That is, 
Ski i = ( n _1 XX^i(-^ifc) yii ' ' ' > Xnki Yn), where ^2 denotes the summation over all possible 

n! n! 

permutations of (l,...,n), and each fli(Xi k , yi; • • • ',X nk ,y n ) is an average of m = [n/2] 
i.i.d random variables (i.e., Qi = m Since the exponential function is 

r 

convex, it follows from Jensen's inequality that, for < t < 2sq, 

^{exp^*^)} = E[e^{t{n^- 1 Y,^i{Xik,yi]--- ]X nk ,y n )}] 
< {n\)- x ^E[exp{tto\{X Xk ,y x ;-' ;X nk ,y n )}] 
= E m { exp (m-Hh^lih^ < M }) }, 
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which together with Lemma [T] entails immediately that 



Pr(S* kl}l -S klA >e) < exp(-te)E m {exp(m- 1 t[h ( f ) l{h { { ) < M} - S kl ,t])} 

< exp { -te + M 2 t 2 /{8m) } . 

By choosing t = Aem/M 2 , we have Pr(S'^ 1 1 — S k i : i > s) < exp (—2e 2 m/M 2 ). Therefore, by 
the symmetry of [/-statistic, we can obtain easily that 

Pr(|^ 1>x - 5fci,i| > e) < 2 exp (-2e 2 m/M 2 ) . (B.2) 

Next we show the consistency of S^ 12 . With Cauchy-Schwartz and Markov's inequality, 

S 2 kl , 2 < E{hl(X ik ,y i -,X jk ,y j )}Pi{h 1 (X ik ,y i -,Xj k ,y j )> M} 

< E [hl(X ik , yi] X jk , y,)} E [exp {s'/ii(X ifc , y*; X jk , yj)}) / exp (s M) , 

for any s' > 0. Using the fact (a 2 + b 2 )/2 > (a + 6) 2 /4 > \ab\, we have 

h 1 (X ik ,y l ;X jk ,y j ) = {(X ik - X jk ) 2 (yi -y j ) T (y i - yj)} 112 

< 2{(^ + ^ fe ) (||y,IU 2 + ||y,||?)} 1/2 < {(^ 2 fe + ^ fc + l|y,H? + lly.llS) 2 } 172 
= ^ + ^ + lly4 2 + IW& 

which yields that 

Eiexpis'hiX^y^X^yj)}} < E [exp {s' (X 2 k + X 2 k + \\ yi \\ 2 q + \\ yj \\ 2 q )}] 

< J E{exp(2 S 'X^)}E{exp(2 S '||y 4 || 2 )}. 

The last inequality follows from the Cauchy-Schwartz inequality. If we choose M = en 1 for 
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< 7 < 1/2 — K , then Ski,2 < e/2 when n is sufficiently large. Consequently, 



Prd^-^l >e) < Pr(|^ lj2 | > e/2). 



(B-3) 



It remains to bound the probability Pr(|iS^ 12 | > e/2). We observe that the events satisfy 



{|^ 1)2 | > e/2) C {X 2 + || y< ||J > M/2, for some 1 < i < p}. 



(B.4) 



To see this, we assume that Xf k + \\yi\\q < M/2 for all 1 < i < p. This assumption 
will lead to a contradiction. To be precise, under this assumption, hi(X ik ,yi; Xj k ,yj) < 
Xf k + X\ + ||yi||g + ||yj||g < M. Consequently, \S kl J = 0, which is a contrary to the event 



Ph,2| > e/2. This verifies the relation (B.4) is true 



By invoking condition (CI), there must exist a constant C such that 



Vr{\\X k \\l + \\y\\ 2 q > M/2) < Prdl^lK > VM/2) + Pr(||y||, > VM/2) < 2Cexp(-sM/4). 



The last inequality follows from Markov's inequality for s > 0. Consequently, 



max PrflStJ > e/2) < n max Pr(||XJ|? + \\y\\ 2 a > M/2) 

l<k<p Vl ■ 1 ' ' ~ l<k<p V " 111 " " 9 ~ ' 7 

< 2nCexp(-sM/4). 



(B.5) 



Recall that M = cn 7 . Combining the results (B.2), (B.3) and (B.5), we have 



Pr(|£fci - S'jtil > 4e) < 2exp (-eV~ 27 ) + 2nCexp (-sn 7 /4) . 



(B.6) 



In the sequel we turn to S k2 . We write S k2 = S k2j iS k2 ,2, where S k2 ,i = n 2 J2\\X ik -X jk \\ l , 
and 5^,2 = n'^Wy-i ~ yj\\ q - Similarly, we write S k2 = S k2:1 S k2 , 2 , where S k2 ,i = E{\\X ik - 
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X,-jfc||i} and 5*2,2 = ^{||yi — Yillg}- Following arguments for proving (B.6) we can show that 



Pr(|5 fc2 i - S k2 i\ > As) < 2exp (-eV^ 7 ) + 2nCexp (-sn 2 V4) , and 

' * * (B.7) 

Pr(|5 fc2 , 2 - 5 fc2 , 2 | > 4e) < 2exp (-e 2 n 1 - 2 ^) + 2nC exp (-sn 27 /4) . 

Condition (CI) ensures that 5* 2 ,i < {£(||X ife - X ifc || 2 )} 1/2 < {4£(X 2 )} 1/2 and 5* 2 , 2 < 
{E(\\yi - yj\\ 2 q )} 1/2 < {4E(\\y\\ 2 q )} 1/2 are uniformly bounded. That is, 



max{ max 5* 2) 1, 5*2,2} < C, 



for some constant C. Using (B.7) repetitively, we can easily prove that 



Pr{|(£ fc2il - VO^I > £ ) < Pr(|5* 2 ,i - 5*2,1 1 > e/C) 

< 2 exp {-eV- 27 /(16C 2 )) + 2nCexp {-sn 2 VA) , 

(B.8) 

Pr( 1 5*2,! (SUa - 5* 2 , 2 )| > e) < Pr(|5* 2 , 2 - 5 fc2 , 2 | > e/C) 

< 2exp {-£V" 27 /(16C 2 )} + 2nCexp (-sn 27 /4) , 

and 

P r {|(5fc 2> i — 5fc 2i i)(5fc 2j2 — 5*2,2)! > ^} 
<Pr (| 5 fe2 ,i-5 fe2)1 1 > v^)+Pr(|5* 2 , 2 -5 fe2 , 2 | > v/i) ( B - 9 ) 
<4exp (-en 1_27 /16) + 4nCexp (-sn 27 /4) . 
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It follows from Bonferroni's inequality, inequalities (B.8) and (B.9) that 



Pr 
<Pr 
+ Pr 



> 3e = Pr 



Sk2 — Sk2 



{Sk2,l — Sk2,l)Sk2,2 



Sk2,lSk2,2 — Sk2,lSk2S 



> 3e 



> e } + Pr 



(Sk2,l — Sk2,l)(Sk2,2 — Sk2,2) 



Sk2,l{Sk2,2 — Sk2,2, 

> e 



> e 



(B.10) 



<8exp {-e^-^/ilQC 2 )} + 8nC exp (-sn 27 /4) 



where the last inequality holds when e is sufficiently small and C is sufficiently large. 
It remains to the uniform consistency of Sk3- We first study the following [/-statistic: 



n ( n _l)( n -2) ^ {\\ X ik ~ XMlyj - yi \\ q + \\X ik - XMlyj - yi \\ q + 

^ ' i<j<l 

\\X ik - Aj- fc ||i||y< - yi||, + \\Xi k - Xj k \\i\\yi ~ yi\\ g + 

ll-^ifc - -Xifc||i||y» - y^lU + \\ x ik -^Qfc||i||yi - yAUj 
— f TT7 — ^ Y] h 3 { x ik,yi; x jk,yf, x ik,yi)- (B.n) 

n(n — l)(n — 2) ^ 

y ' y ' i<j<l 



Here, h 3 (Xik, y»; Xjk, yy Xik, yi) is the kernel of [/-statistic S^ 3 . Following the arguments 
to deal with S^, we decompose h 3 into two parts: h 3 = h 3 l(h 3 > M) + h 3 l(h 3 < M). 
Accordingly, 



6 



'k3 



£ ^Hh <M) + 



6 



S, 



A-3 



n(n — l)(n — 2) ~* <x n(n — l)(n — 2) . f 

D fc3,l D fc3,2? 

E {h 3 l(h 3 <M)} + E {h 3 l(h 3 > M)} = Sks,i + S k3 , 2 . 



— x: > m) 



Following similar arguments for proving (B.2), we can show that 



Pr(|^ 3)1 - S k3 , x \ >e)< 2exp (-2e 2 m'/M 2 ) 



(B.12) 
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where m! = [n/3] because S^ 31 is a third-order [/-statistic. 

Then we deal with S^. We observe that h 3 (X ik , y«; X jk , yf, X ik , y { ) < 4(X 2 fe + + 
-Xj* + llYillg + llYill? + lly/llg)/ 6 . wh -ich will be smaller than M if X? fe + ||y;|| 2 < M/2 for all 
1 < i < p. Thus, for any e > 0, the events satisfy 

{\S* k3>2 \ > e/2} C {X 2 + || y< ||; > M/2, for some 1 < i < p}. 



By using the similar arguments to prove (B.5), it follows that 



Pr(|£*3,2 - 5 fc 3, 2 | > e) < Pr(|5* 3i2 | > e/2) < 2nC exp(-sM/4). (B.13) 



Then, we combine the results (B.12) and (B.13) with M = crC for some 0<7< 1/2 — k to 
obtain that 



Pr 



>2 £ ) < 2exp(-2£ 2 n 1 " 27 /3) +2nCexp(-sn 7 /4). (B.14) 



By the definition of S, 



k?,- 



_ ( w -l)(n-2) (g, _J_ V 



rr 



k3 (n-2) kl 



Thus, using similar techniques to deal with S k i, we can obtain that 



Pr 



S k 3 — S k s 



> As 



Pr 



(n-l)(n-2) f~ \ 3n-2 

^k3 - ^k3] — O k 3 



n- 



+ 



n — 1 



r?- 



ki 



> As 



Using similar arguments for dealing with S k i, we can show that S k 3 is uniformly bounded in 
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p. Taking n large enough such that {(3n — 2)/n 2 }Sk3 < £ an d {{n — {)/n 2 }Ski < £, then 



Pr(|5 fc3 -5 fc3 | >4 £ ) <Pr(|S* 3 -S fc3 | > e) +Pr{\S* kl - S kl \ > e} 

< 4exp (-£ 2 n 1_27 /6) + 4nCexp (-sn 7 /4) . 



(B.15) 



The last inequality follows from (B.6) and (B.14). This, together with (B.6), (B.10) and the 
Bonferroni's inequality, implies 



Pr{ | (Ski + Sk2 — 2 Ska) — (Ski + Sfc2 — 2Sfc3) | > ^} 
<Pr(\S kl -S kl \ >e/A)+Pv(\S k2 -S k2 \ > e/A) +Pv(\S k3 - S k3 \ > e/4) (B.16) 
=0 {exp (— Ci£ 2 r2 1-27 ) + nexp (— C2?t, 7 )} , 

for some positive constants ci and c 2 - The convergence rate of the numerator of u k is 
now achieved. Following similar arguments, we can obtain the convergence rate of the 



denominator. In effect the convergence rate of u k has the same form of (B.16). We omit the 



details here. Let e = cn K , where k satisfies < k + 7 < 1/2. We thus have 



Pr{ max \u) k — u k \ > cn K \ < p max Pr \ \u) k — to k \ > cn K \ 

1 l<k<p ' l<k<p U ' 

<0(p [exp {-cin 1-2 ^} + nexp (-c 2 n 7 )]) . 

The first part of Theorem [T] is proven. 

Now we deal with the second part of Theorem 1 If D T>*, then there must exist some 
k G T> such that u) k < cn~ K . It follows from condition (C2) that \Q k — w k \ > cn~ K for some 
fceD, indicating that the events satisfy \T> ^ C {\u k — u k \ > cn~ K , for some k e £>}, 
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and hence £ n = {max \Qk — ujk\ < cn K ) C {DCD*}. Consequently, 
1 kev J 1 J 

Pr(X? C P*) > Pr(£ n ) = 1 - Pr(^) = 1 - Pr(mm \u k - u k \ > cn- K ) 
= 1 - s n Pr {|cDfc - o; fc | > cn~ K ) 
> l-0(s n [exp { -dfi 1-5 ^' } + n exp (-c 2 n 7 )] ) , 

where s n is the cardinality of V. This completes the proof of the second part. n 
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